simdjson talk

2022-03-26 ยท 4 min read

"Data Engineering at the Speed of Your Disk" - Daniel Lemire - 2020

  • Modern disks are pretty fast
    • PCIe 4 disks: 5 GB/s sequential read
  • Modern datacenter VPC network real fucking fast
    • 50 Gb/s (6.25 GiB/s) and better
  • If we can't eat data at many GB/s, we'll literally be CPU-bound reading from disk or network!
  • How fast can we allocate memory?
// almost instant (just virtual memory)
let mut buf = Vec::with_capacity(size);

// actually instantiate physical pages
for i in (0..buf.len()).step_by(page_size) {
	buf[i] = 0;
}

3.5 GB/s (Linux, Skylake 3.4 GHz, 4 KiB pages)

  • How fast can we, e.g., remove spaces from a string?
    • Assume 1% space density (minified json)
for i in 0..buf.len() {
	if buf[i] > 32 /* space as decimal = 32 */ {
		buf[i + 1] = buf[i];
	}
}

1.6 GB/s

  • Just working byte-by-byte will already have limit of ~3.5 GB/s b/c our processor is 3.5 GHz.
  • We're already slower than disk if we work byte-by-byte.
  • so how are we going to go faster? SIMD : )

AVX-512 instructions can operate on an entire cache-line at once : 0

  • Working with SIMD

  • (1) use SIMD intrinsics. produces very disgusting code though.

  • (2) pray / massage code so compiler auto-vectorizes : )

  • despacer with SIMD

// filled with spaces
__m128i spaces = _mm_set1_epi8(' ');

for (i = 0; i + 15 < size; i += 16) {
	// load next chunk
	__m128i x = _mm_loadu_si128(bytes + i);

	// set a 1 byte anywhere we have an x byte <= 32
	__m128i anywhite = _mm_cmpeq_epi8(spaces, _mm_max_epu8(spaces, x));

	// condense anywhite mask into a packed u64 mask
	uint64_t mask16 = _mm_movemask_epi8(anywhite);

	// basically a pdep, but deposit the non-space bytes at the front
	x = _mm_shuffle_epi8(x, despace_mask16[mask16 & 0x7fff]);

	// store in-place
	_mm_storeu_si128(bytes + pos, x);
	// increment our in-place output pointer only by the # non-spaces.
	pos += 16 - mask16.count_ones();
}
  • base64 encode / decode lemire/fastbase64

    • can encode w/ SIMD in only 3 instructions : 0 (mod loads/stores)
    • decoding closer to 7 instructions
    • once your data is not just in L1/L2 cache, this is just as fast as a memcpy
  • utf-8 encode / decode

  • byte-by-byte approach only 300 MB/s D :

if b1 < 0x80 {
	return true; // ascii
}
if b1 < 0xe0 {
	if b1 < 0xc2 || b2 > 0xbf {
		return false;
	}
} else if b1 < 0xf0 {
	// 3 byte form
	if b2 > 0xbf || (b1 == 0xe0 && b2 < 0xa0) || (b1 == 0xed && b2 >= 0xa0) {
		// ..
	} else {
		// ..
	}
} else {
	// 4 byte form
	// ..
}
  • using SIMD: 8 GB/s

  • using 32-byte chunks, ~20 instructions. branch-less

  • https://github.com/lemire/fastvalidate-utf-8

  • JSON for Modern C++ library (nlohmann-json)

  • 0.1 GB/s (Skylake 3.4 GHz, GCC8, file: twitter.json)

  • rapidjson

  • 0.3 GB/s (Skylake 3.4 GHz, GCC8, file: twitter.json)

  • both of these libs nowhere near disk/network speed

  • what is our "roofline" benchmark?

let mut all_line_lens = 0;
for line in reader.lines() {
	all_line_lens += line.len();
}
  • 1.4 GB/s (Skylake 3.4 GHz, GCC8, file: twitter.json)

  • but you can actually go faster!

  • simdjson

  • 2.5 GB/s (Skylake 3.4 GHz, GCC8, file: twitter.json)

  • some of the tricks used in simdjson

  • Find the span of the string

    • Given a mask containing quotes, return a mask set to 1 in the string span regions.
    • __1________1________1___1 quotes
    • ___11111111__________111_ string regions
mask = quote xor (quote << 1);
mask = mask xor (mask << 2);
mask = mask xor (mask << 4);
mask = mask xor (mask << 8);
// ..
  • string span can actually be done in one instruction : 0

  • Number parsing is expensive

    • strtod
    • 90 MB/s
    • 38 cycles / byte
    • 10 branch misses / float
    • yikes!
  • SIMD approach

fn contains_8_digits(buf: &[u8; 8]) -> bool {
	let val = u64::from_ne_bytes(buf);
	(   (val & 0xf0f0_f0f0_f0f0_f0f0) |
		(((val + 0x0606_0606_0606_0606) & 0xf0f0_f0f0_f0f0_f0f0) >> 4)
	) == 0x3333_3333_3333_3333
}

fn parse_8_digits_unrolled(buf: &[u8; 8]) -> u32 {
	let mut val = u64::from_ne_bytes(buf);
	val = ((val & 0x0f0f_0f0f_0f0f_0f0f) * 2561) >> 8;
	val = ((val & 0x00ff_00ff_00ff_00ff) * 6553601) >> 16;
	((val & 0x0000_ffff_0000_ffff) * 42949672960001) >> 32
}

https://github.com/lemire/fast-float-rust